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1 Introduction 

Discrete-event simulation is an invaluable tool for the design and analysis of complex systems such 
as factories, transportation networks, computer systems, and communication networks. Large scale 
simulations require a long time to execute, and because of this many researchers are interested in 
parallelizing their execution. One of the key issues is synchronization between processors, as the 
synchronization demands are highly variable, depending dynamically on the simulation model's 
S a e. Recommended introductory surveys on the topic are found in [2] and [11] 

In a series of previous papers [5, 9, 10] we developed the notion of using uniformuation as the 

<iis " etc<!v '”' simulation of continuous-time Markov chains 
. . " C models are important, appearing frequently in the study of computer and com- 

munication systems. Uniformization exploits the mathematical structure of these models, making 
it possible to pre-compute instants in simulation time where Logical Processes (LPs) ought to syn- 
chronize. The decision whether an LP actually influences another at one of these instants is left 
until run-t, me Conceptually, a simulation is performed in three phases. In the first phase, the 
simulation model ,s partitioned into LPs, which are mapped to processors. All simulation activity 
associated with an LP is assumed to be performed by its assigned processor. In the second phase 
one randomly generates synchronization points; in the third phase one simulates a mathemati- 
tally correct sample path through those points. We call the general method PUCS, for Pamllel 
Umjormized Continuous-tune Simulation. 

We have developed five different variations of PUCS that differ in their treatment of LP aggre- 
gat, on, communication management, use of optimism, and generation of communication schedules: 

• Conservative Aggregated PUCS (CA-PUCS), 

• Conservative Partitioned PUCS (CP- PUCS), 

• Optimistic PUCS (Opt-PUCS), 

• Adaptive Conservative Aggregated PUCS (ACA-PUCS). 

• Adaptive Conservative Partitioned PUCS (ACP-PUCS). 

CA-PUCS uses no optimism, and treats the entire submodel assigned to a processor as a single LP. 
ync romzation between LPs is thus equivalent to synchronization between processors. CP PUCS 
so esc ews optimism, but permits a processor’s submodel to be viewed as a collection of LPs that 
are resident on the same processor. Opt-PUCS also allows multiple LPs per processor, synchro- 
nizes optimistically, and uses techniques to reduce state-saving and perform smart on-processor 

rVoirn?’ 1 ^ teCmiques are made P oss 'We by the basis in uniformization. ACA-PUCS is like 
CA-PUCS except that it attempts to reduce some overheads associated with synchronization and 

:"-^° W leSS abOUt the SimUlati ° n m0deL SimilaHy ’ ACP - p UCS is an adaptive 

Each of these methods has strengths and weaknesses that are alternatively revealed by prob- 
lem characteristics. The object of this paper is to give an overview of uniformization-based 


1 



synchronization, and empirically examine these different methods on the Intel Touchstone Delta 
multiprocessor [7], using up to 256 processors. g ection 2 gives an overview of direct 

The remainder of the ^ Sect ion 3 introduces each method and its rationale. 

Markovian simulation, and uni orn , | Serlion 5 (rives our conclusions. 

Section 4 presents and analyses our experimental results, and Section g 

2 Uniformization-Based Synchronization 

nr " the " escrip ~ 

tions we illustrate them concretely with an example^ ^ continuous time Markov chains. Read- 

Let us first review some basic elem r 12l t r a more complete and exact 

r -r -fh PTMCs are encouraged to consult Ross [U\ tor a more h 
ers unfamiliar with CTMCs _ ^ stocha stic process where X(t) is the Mate of 

introduction to the tope. , descrip , ionj ,V(I) is taken to be a nonnegative 

the CTMC at time t. For the P«rP»“ 8 de8Crib( , y (t) as a vector of integers, e.g„ the vector 
integer; in practice it is often more na time (> t \„. CTMC remains in that state 

of queue lengths in a network. Upon en ering a ’ exponential distribution with 

for a random period of raic of state s. At the end of the 

state dependent rate A(®) ^ ^ ^ jumpi „ g some state s'. It is convenient t to 

holding time, the CTMC randon y 8 - ble iumps, in the following way. While 

think of this jump as choosing a winner state reachable from s, si- 

in state s the chain is attempting to ma e stochastic processes-one for each state 

multaneously. It is as though there are a large nu ^ {or the process attempting to 

distinct from s-that are all concurrency ac ive ^ ^ processes haii an exponentiaUy 

jump to s' is some q $ note tha (s) Yl{ s ; # s ) ss . . vy e maY imagine that each of 

distributed hoiding time; the rate o, s"s holding *-.-*■ } . 7w the time and nature 
these holding times are random,, 1, position time is .east among 

of {*(()}’> transition out of 5 “? * j "'hi, ^he exponential associated with a given state s' is least 
all possibilities. Thus the P™ ^ 1 , g also known aa a transition probability, 

among its peers is just Pss' 9ss / f h ... • terms G f (X{t)} simultaneously attempting 

Observe that we t ^ examp l e , we might partition the state-space into 

jumps to one of a number - t ition between all transitions to states in 

two sets A and B, and interpret transRwii as 1 1 be rticularly use ful when .4 is the 

^ and all transitions to states in B Tta -terp^re ^ ^ ^ ^ ^ ^ do . 

set of transitions that Ao noi Me o > J g hoWing times , and choosing transitions as 

A direct simulation of a Cl M exponential with rate A(s), 

follows. Upon entering state s, one advances in state s . To choose a transition 

esseutially simulating the duration of «u , . xpone „ti a ls. It suffices to construct 

it is not necessary to choose the least of a large from the distribution 

the transition distribution by computing the rates and samp 
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Uniformization of a CTMC is a mathematical device (originally used to simplify numerical 
solution [4]) designed so that every holding time is drawn from the same distribution. The basic 
idea is to find a uniformization rate A max such that for every state s, A(s) < A max . All holding times 
are sampled from the exponential distribution with rate A max . However, to make the uniformized 
chain stochastically identical to the original chain, we introduce transitions back to the same state. 
In the uniformized chain, the probability of making a transition from s to s' s) is qr ss r/A max . 
The probability of making a transition back to s is 1 - A(s)/A max . Transitions of the latter form are 
known as pseudo transitions , as they do not affect the state of the Markov chain. The mathematical 
basis for uniformization is simply that a geometrically distributed sum (with mean 1/p) of i.i.d. 
exponential random variables (with mean 1/p) is itself an exponential, with mean 1 /(p/i). Whenever 
the original chain in state s; its holding time is exponential with rate A(s). Now suppose the 
uniformized chain (at rate A max ) enters state s; the number of pseudo transitions that occur before 
actually leaving s is geometrically distributed with mean A max /A(s), and the distribution of time 
spent in s before leaving is that of a geometrically distributed sum of exponentials, each with mean 
1/^max- The effective distribution of time the uniformized chain spends in state s is exponential 
with mean 1/A(s), just as in the original chain. 

Let us now apply these ideas to a specific example. Consider a type of queue that has A' servers, 
a Poisson source process with rate A, and a service distribution that is a probabilistic mixture of 
exponentials: with probability pj the service time is exponential with a fast rate fij, and with 
complimentary probability the service time is exponential with a slow rate p s < p f . Now' imagine 
a queueing network with three such queues. We suppose that every departing job exits the system 
with probability pj; conditioned on not departing the system, the job rejoins the same queue with 
probability p r , and otherwise joins either of the other queues with equal probability. The state of 
one queue, say i, in this system can be described by a vector s, = where N t gives the 

total number of jobs in residence at the queue, F, gives the total number of fast jobs in service, and 
Si gives the total number of slow jobs in service. In the absence of any job transfers from other 
queues, the transition rate out of s, is A,(s,) = A + Fip,j + 5,/x s . The state of the entire system is 
the concatenation s = (si,s 2 ,S 3 ), with total transition rate A(s) = Ai(sj) -f A 2 (s 2 ) + A 3 (s 3 ). 

Under an ordinary direct simulation of the Markov chain, the system remains in a given state 
s for an exponentially distributed period of time with rate A(s). After the holding time, the chain 
makes one of several possible transitions, chosen randomly. Transition due to a source arrival 
at queue i is chosen with probability A/A(s), while transition due to a fast (alt., slow) service 
completion at queue i is chosen with probability p f F t /X(s) (alt., p s S t / A(s)). Following simulation 
of the chosen transition and its effect on s, a new holding time is chosen based on the new state, 
and the simulation process continues. 

An alternative form of direct simulation is more closely related to how we do parallel simulation. 
Let us now view the system as three interacting Markov chains, each one simulated directly. This 
is equivalent to partitioning all transitions into three classes, grouping together all transitions that 
are initiated at a common queue. We maintain a simulation clock i, for each queue i , reflecting 
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the end of the queue’s current holding time. To select the next event to do in the system we 
first select the queue i whose time U is least (recall the interpretation of a transition m terms of 
competing exponentials). We then directly simulate that queue, choosing a source arrival with 
probability A/A;(s,), a fast job departure with probability F>// A t (s t ), and a slow job departure 
with probability S l p s / A,(s;). If a job departure is chosen, then with probability p d the job leaves 
the system. If the job does not leave the system, then with probability p r the job rejoins the same 
queue. Failing this, the job is routed to one of the other two queues, with equal probability. Queue 
i now has a new state s'; its new next transition time is chosen by adding U to an exponential 
random variable with rate A^s'). Observe that if the event caused a job to be routed to queue j ± *, 
then queue j has a new state s' . We compute a new next transition time for queue j by adding U 
(not tjl) to an exponential random variable with rate Aj/(s'). Also observe that if the event does 
not route a job to another queue, then the distribution of the holding times of the other queues are 
unaffected by the event, and, by the memoryless property of the exponential distribution, do not 
need to be resampled prior to selecting the next event. They could be resampled, but the resulting 
chain would be probabilistically identical to the one where we do not. 

This description suggests that one might directly simulate each queue on a separate processor, 
provided one can accommodate the instances when jobs flow between queues. It is convenient 
to view the behavior of a given queue as the superposition of an internal stream of events, and 
a set of external streams. The internal stream is comprised of all events that do not directly 
affect the state of another queue: Poisson source arrivals, departures that leave the system, and 
departures that return immediately to the same queue. We have one (outgoing) external stream 
associated with each other queue; such a stream is comprised of all transitions that send a job to 
the associated queue. Now for each queue i we maintain a next internal transition time A, and 
two next outgoing external transition times E t j, and F lk . j,k ^ i ■ The queues next transition 
time is the minimum, U = min{/„ E, v E ik }. This is simply another application of the “competing 
processes” interpretation of a transition. When queue t is chosen for the next transition, then 
the event is taken from the stream whose next transition time is t t . After simulating that event 
so that the queue enters state s', new next transition times are chosen for all streams associated 
with the queue. The reason for changing every stream’s next transition time is apparent from the 
rates of these streams’ holding times. The holding time rate for the internal process from state 

s t = (N t ,Fi,Si) is 

A ■ (s ,) = A + {Pd + (1 - Pd)Pr) (pj Fi + p 3 Si ) , V 1 ) 

while the holding time rate for either external process is 

Ag(s,) = 0.5(1 - Pd){ 1 - Pr) (PfFt + PsS t ) • C 2 ) 

Equation 1 reflects arrivals to queue i (at rate A), and service completions that either depart the 
system or are routed back to queue i. (Note that the total rate at which jobs in queue t are 
receiving service is (pjF+p s Si), a fraction p d of which depart the system and a fraction (1 ~Pd)Pr 
of which are routed back to queue i.) Both of these rates depend on the state of the queue; any 
event at that queue may change its state, and hence change the correct distribution for the next 
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event on each stream. New next transition times are computed by sampling from the new holding 
time distribution, and adding to the time of the event, t x . If an external event is chosen, then 

new holding times must also be chosen in like fashion for all streams of the queue receiving the 
departing job. 

Note that the transition rates above depend only on the number of fast and slow jobs in service. 
As such, these rates are independent of the queueing discipline, whose effect is manifested in the 
definition of the state transformation upon a job departure or arrival. A key point is that the 
transition rates are independent of the queueing discipline. This is important to remember, as it 
will imply that our synchronization algorithm is independent of the queueing discipline. 

The problem remains that the instants when jobs leave one queue for another are erratic and 
unpredictable. We approach the problem by uniformizing every external event stream. Why? 
Because the holding time distribution of an external event stream then remains completely in- 
dependent of any state changes that may occur at the queue. We can completely pre sample the 
holding times of all uniformed external event streams. Embedded in these transition times are 
real ones, where jobs move between queues. We do not know which of these transitions will actually 
move jobs and will not know until the simulation is actually performed, the queue states are actu- 
ally known, and the real/pseudo decision thresholds are actually computable. The beauty of the 
method is that the queues can generate and exchange their external transition times, and then use 
these times as synchronization points, a.k.a. “appointments” [8]. Queue 1 presamples the potential 
transition times for its external event streams to queue 2 and queue 3. These potential times are 
sampled from Poisson processes whose rates, A 12 and A 13 are at least as large as the maximum 
possible instantaneous rates at which queue 1 can send jobs to queue 2 and queue 3. For example, 
^i 2 = 0.5(1 - p d )(l - p r )Kp s , which represents the rate at which jobs flow from queue 1 to queue 
2 when all A servers on queue 1 are busy serving in the fast phase. By Equation 2, Af 2 (s;) < A 12 
for all possible states s After queue 1 presamples these potential external transition times, it 
sends those lists to queue 2 and queue 3. Each queue i receives from every other queue j lists of 
times at which a job might be sent from j to i. These lists are merged with queue V s own lists of 
times when it may send jobs to other queues. The n th entry in the merged list for queue i is of the 
form (Ti(n),Ci(n)), where T,(n) is the time of the n th event and Ci(n) is the type of the n th event, 
i.e., C t (n) — (t,j) or ( J,i ), depending on whether the potential job goes from i to j or vice versa. 
Having done so, each queue now knows each and every time at which some other queue may affect 
it, and at which it may affect some other queue. Without uniformization the synchronization times 
are unpredictable; with uniformization they are completely pre determined in advance of actually 
running the simulation. 

As we simulate in parallel, each processor will execute asynchronously of the others, except 
for synchronization at the pre-arranged instants in time. For example, suppose that the state of 
queue i is s,, that the last event at queue i occurred at time t t , and that T t (n) is the time of the 
next (potential) external event. An exponential holding time E x with mean l/A/(s t ) is generated. 

If U + Ei < T t (n), then the next event to occur on queue i is an internal event. In this case, 
among all possible internal transitions, queue i chooses one with probability proportional to its 



transition rate, simulates it, and updates its clock to time t, + E { . If U + E t > T t (n), then the 
next event to occur at queue i is an external event. Suppose that Ci(n) = (i,j). Then queue i 
decides whether the transition is pseudo or real by computing the ratio r = e ratl ° 0 

the stream’s current actual rate to the stream’s uniformized rate), opting for a real trans.t.on i a 
uniform 17(0, 1) random variable is less than or equal to r. In this case queue * selects a job whose 
service completes, selecting any particular job with probability proportional to the rate at w ic 
that job is departing for queue j (0.5 /z a or 0.5/x/, depending on whether the job is fast or slow). 
Queue i sends a message to queue j specifying the job transfer and continues. If C,(n) is judged to 
be a pseudo (with probability 1 - r), then queue i sends a message to queue j reporting this fact. 
Alternatively, if C<(n ) = then <l u eue i waits for the message from queue j If queue j reports 

a job arrival, then queue * simulates the arrival. If queue j reports a pseudo then the event does 
not affect queue Vs state. Following simulation of C t (n), queue i advances its clock to time 2i(n). 

A new holding time for the internal process is selected, and the process continues. 

Observe that the description above serves to describe a general algorithm, if we merely replace 
the word “queue” with “LP”. Also observe that it is possible to define windows [t,t+ A] in simulation 
time. One generates and exchanges all uniformized external events that fall within the window 
simulates the system behavior through that period, then advances to the next window [f+A,f+2Aj 
The only limitation on the window size A is the memory storage necessary to hold the external 

transition times. , f , 

Calculation of uniformization rates is always application dependent. Among all features of the 

algorithm, this is one of the issues demanding the most attention by the modeler to the sync co- 
nization algorithm. (The other major such issue is decomposing event streams into internal and 
external streams.) It is possible, for example, for LPs to be defined so that jobs from an infinite 
server queue are routed to different LPs. One can’t bound the transition rate of such an external 
stream, at least not in an open system. The method works best when every stream’s uniformization 
rate is very close to its actual job transfer rate, i.e., when most external events are rea owever, 
this may not always be the case. Pseudo transitions are the single most deleterious artifact of the 
algorithm, because time spent generating, communicating, and synchronizing upon pseudos is tune 
spent on activity not found in an optimized serial implementation. All of our PUCS variations 
were developed to reduce or eliminate sources of pseudo transitions, or to minimize their effect on 

Perf NoTlll C CTMCs are suitable for parallel simulation using PUCS. A key requirement is that one 
be able to partition the CTMC into loosely synchronous interacting subchains. Such partitioning 
follows intuitively when the CTMC has a basis in a physical domain, because partitioning the 
domain often has the desired effect. Nevertheless, the issue of defining suitable LPs automatically 
is one that we have not yet addressed. 

The details above may seem complex, especially to those with little experience dealing with 
CTMC models. However, there is strong reason to believe that PUCS-style synchronisation can 
| )e embedded in a parallel simulation package specific to an application class (e.g., a large subset 
of RESQ [3] for simulating queueing networks), with all the details of finding legal uni ormiza ion 
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rates being automated. 

In the course of experimenting with PUCS we encountered several implementation issues. One 
of these concerns external stream list management. On the one hand we can faithfully implement 
PUCS as described above. On the other hand, we can avoid list transmission altogether, by having 
both ends of an external stream maintain a synchronized random number generator, so that LP 

.* C ° mpUt ** thC Ume ° f the neXt LP * LP »’ synchronization, rather than recent it. Another 
issue is the degree of aggregation one ought to employ when defining LPs. It is possible for the 
ent.re submodel assigned to a processor to be considered as a single LP. It is also possible to break 
up the model into more natural LPs, and treat the workload on a processor as a collection of 
istmct LPs. Yet another issue is whether to exploit optimism. The uniformization framework 
o ers some unique optimizations for optimistic processing. Are they worth it? A final issue is 
that of adaptivity in uniformization rates. What can you do when a mathematically correct upper 
bound ,s either impossible or so large that almost all external transitions end up being pseudos? 
Our various implementations, to be described next, explore these issues. 


3 Methods 

We describe five different methods based on uniformization, and give the rationale for each. 

3.1 Conservative Aggregated PUCS 

CA-PUCS (identified simply as PUCS in [5, 9, 10]) was one of the first methods we developed, 
n implementation it is almost identical to the description given in the last section. It has the 
a itional characteristics that the entire submodel assigned to a processor is considered to be one 
LP, and that synchronization lists are generated and simulated on a window-by-window basis. The 
latter feature ,s needed for the simple reason that computers’ memories can retain only a finite 
number of external transition descriptions, and very long runs will require very long transition 

The rationale for aggregating all co-assigned workload into one LP is two-fold. First, a one- 
LP-per-processor implementation is much easier to develop than one that allows multiple LPs 
The architecture used in our studies-the Intel family of multiprocessors-supports interprocessor 
communication v.a explicit sends and receives. Receives may be either asynchronous (post a receive 
and periodically check on whether the anticipated message arrived yet) or synchronous (block until 
t e anticipated message arrives). Furthermore, the Intel iPSC/860 and Touchstone Delta operating 
system NX, supports only one process per processor. Any multitasking-like switching between 
LPs has to be done at the application layer. By aggregating all of a processor’s workload into one 
LP we avoid scheduling issues; furthermore, there is no need to buffer incoming communication at 
the application layer. When the processor expects message m at time t from processor j, it simply 
does a synchronous receive until that message materializes. One cannot use synchronous receives 
if switching between LPs is necessary. Secondly, massive aggregation avoids internal pseudo events 
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that may occur when multiple LPs are assigned to one processor. The problem here is ta i 
uniformization is applied at the LP level, then two LPs on the same processor synchronize with 
each other just as though they were assigned to separate processors. We surely can develop 
code so that the communication between co-resident LPs is cheap, but we cannot easily avoid the 
overhead of generating, communicating, and synchronizing upon a pseudo event. An important 
rationale for massive aggregation is to eliminate the possibility of internal uniformization. 

3.2 Conservative Partitioned PUCS 

The other side of the aggregation issue is that massive aggregation can cause artificial blocking. 
Events on a processor under CA-PUCS are executed in increasing monotonic, order. any piece o 
a processor’s submodel needs a message at time t and if that message is not yet present the entire 
processor blocks. However, it may be that another piece of the submodel is free to continue past 
time f To block at time i is to cheat oneself of some potential parallelism. 

CP-PUCS (identified as PUCSThreads in [9]) allows multiple LPs per processor, and also strives 
to reduce the communication overhead of list generation. The principle features of the method are 

• LP independence: A processor may manage any number of distinct LI s. In addition, by 

appropriate assignment of random number generator seeds, the sample path that is executed 
can be made independent of the way in which LPs are assigned to processors. 

. Scheduling: At any time, each LP is classified as being ready or blocked, depending on 

whether it is free to execute or is waiting for an incoming message. Scheduling consists of 
selecting the ready LP with least time-stamp, performing a communication (either a sen or 
a receive) and simulating until it reaches its next communication instant. If an LP blocks 
waiting for a message, a description of that message is stored in a binary search tree. Between 
LP activations we probe for any newly received messages, accepting all such and storing 
in the application space. As each new message is processed we examine the search tree to see 
if some LP is blocked on this message. If so, the LP is unblocked and placed on the of 

ready LPs. 

. List Generation: Every pair of LPs i and j maintain a synchronized random number 

generator. This means that LP i can compute for itself the same transition times a 
computes for the LP j to LP . external stream. While each LP now executes more work b, 
duplicating the generation of external stream transition times, we avoid having to commu- 
nicate and merge the lists. There is an additional advantage in that no window is needed 
now to limit the memory usage of external transition times. We simply generate the next 
transition time for a stream when it is needed. 

Somewhat to our surprise, our previous empirical studies found no real benefit of CP-PUC S over 
CA-PUCS Those studies examined situations in which the deleterious effect of interna 1 pseud os 
is the dominant bottleneck to achieving good performance, and thus the benefit of avoiding em 
1, weighed the benefit of more parallelism. However, as we will see, data in the present paper 
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shows that this is not always the case and there are situations in which CP-PUCS outperforms 
CA-PUCS. We will comment more on this in Section 4. 


3.3 Optimistic PUCS 

Opt-PUCS (identified in [9] as OptAU) endows CP-PUCS with optimism. This comes into play 
when an LP reaches an incoming communication instant, and the message it is to receive is not 
yet present. The LP can optimistically assume that the message will report a pseudo transition 
and hence there is no need to wait for it. When the message does finally arrive, if the receiving 
LP’s guess was correct, then there is no need to roll back. This is an application of the idea of 
lazy reevaluation” explored first in [13]. Otherwise, as with standard optimistic algorithms such 
as Time Warp[6], the receiving LP is rolled back to the time of the late message. 

PUCS general framework makes possible some unique optimizations. 

State Certainty. In a general purpose optimistic environment, one can never be certain 
whether the next event processed will end up being committed, or will be discarded as a 
result of rollback. In Opt-PUCS an LP can sometimes know that its state is sure, that it 
will not be rolled back past its present point. The key to this determination is that we know 
all instants in simulation time where messages may arrive. If LP i knows it will not receive 
any message between times s and t, and it knows that its present state is sure (all LPs are 
initially sure), then its state remains sure while processing all internal events up to time 
t. Furthermore, if LP j sends the message at t. and was also sure at the time the message 
was sent, then the message may be received and LP i remains sure. However, if either LP 
j was unsure at time t, or if the LP i decides to optimistically bypass that communication, 
then LP i becomes unsure. In [9] we show how every LP can maintain a Least. Sure Time 
(LST) that describes the last instant in simulation time when the LP was sure. By simply 
appending sure/unsure tags to messages and analyzing these, every LP’s LST advances 
without extra calculation. Since we may release any state saved at a time less than the LST, 
the LST calculation gives us the benefits of the usual GVT (Global Virtual Time— see [6]) 
calculation, without the additional overhead of actually performing a GVT calculation. 

• State-Saving: Optimistic simulations generally save state prior to every event, because as 

far as the LP knows, the simulation can in theory be rolled back to any point in simulation 
time ahead of the last known GVT. Within the PUCS framework, a rollback can occur only 
at some communication instant, hence there is no advantage to saving state before an internal 
event. The only time state must be saved is at a communication instant, and then only if the 
receiving LP is either unsure or becomes unsure by either receiving an unsure message or 
by optimistically bypassing it. 


• Scheduling: Our ability to ascertain whether an LP’s state is sure permits smarter schedul- 
ing than is usually possible under Time Warp because we may give highest priority to an LP 
with some work to do that we know is sure, and cannot be rolled back. In fact, our studies 
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in [9] found that a very effective scheduling strategy is one that is averse to state-saving, as 
follows. An LP’s execution slice is delimited at either end by external communications (either 
incoming or outgoing); the execution slice begins by performing a communication, then all 
internal work up to (but not including) the next communication is performed. Whether or not 
we perform a state-save at the initial communication depends on the present sure/unsure 
state of the LP, whether the communication is outgoing or incoming, and whether an commu- 
nication is present or unsure. We define four scheduling classes, listed below in decreasing 

order of priority. 

1. sure LPs that will not save state because the first communication is either an incoming 
message from a sure LP, or is an outgoing message. 

2. unsure LPs whose first communication is either an incoming message from a sure LP, 
or is an outgoing message. 

3. sure LPs that must save state on the first communication, because that communication 
(necessarily incoming) is either not yet present, or was sent by an unsure LP. 

4. unsure LPs that must save state on the first communication, because that communica- 
tion (necessarily incoming) is either not yet present, or was sent by an unsure LP. 

One of our aspirations for Opt-PUCS was that it would reduce the cost of pseudo transitions. 
While pseudos would still appear logically in the external event streams, the hope was that not 
having to communicate them from unsure LPs would lead to some savings. Our initial experiments 
showed that this intuition held true, provided that the fraction of pseudo events was very high. For 
lessor fractions of pseudos, the overheads of optimism largely cancelled the benefits of optimism. 
This observation is also borne out in the new data we present in this paper. One should also bear 
in mind that the version we study in this paper is highly optimized. Our previous study suggested 
that its performance is as large as a factor of 2 better than standard Time- Warp style algorit ms. 

3.4 Adaptive PUCS 

We developed ACA-PUCS and ACP-PUCS in an effort to deal directly with the problem of excessive 
pseudo events. The idea is to observe the behavior of an external stream, and uniformize it at a 
rate slightly larger than the maximum rate it seems to achieve and repeat. There are two basic 
issues that must be addressed. One is the selection of uniformization rate, and the other is dealing 
with situations where the assumed upper bound on the external stream’s transition rate actually 
becomes less than the actual transition rate— an occurrence we call a rate fault. 

To uniformize a stream at a rate which is not provably an upper bound on its transition rate 
is to execute optimistically. Some provision must then be made to recover from faults suffered 
when optimistically made assumptions are violated. Our earlier experience with other versions of 
PUCS suggested that CA-PUCS was an appropriate point of departure, as it consistently achieve 
better performance (on the problems studied) when the fraction of pseudo events was low. The 
simplest way to incorporate optimism in CA-PUCS is to checkpoint the entire simulation state at 
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4 Experiments 


In this section we present the results of experiments performed on the Intel Touchstone Delta 
multiprocessor^], using 16, 64 and 256 processors. The Delta is an MIMD architecture based on 

the Intel ,860 processor chip. Its processors are connected in a mesh network. Communication is 
based on circuit switched message passing. 

The simulation model we study is that of a fully connected network of central server queueing 
clusters [I], A s.ngle central server is illustrated in Figure 1. A job entering the cluster always 
visits the CPU queue first. After receiving service there, the job is routed to one of twenty I/O 
servers, chosen uniformly at random. Upon entering service, the job chooses a “fast” service rate 
of fif Wlth P robab ility P;\ it otherwise acquires “slow” service rate of 1. The job receives an 
exponentially distributed amount of service, with mean \/p } or mean 1, depending on whether the 
jo is fast or slow. Upon its service completion the job returns to the CPU server with probability 
p c . Otherwise, some other central server cluster is chosen uniformly at random, and the job is 
routed to that cluster’s CPU queue. Throughout our study of PUCS we have used this model, or 
another one related to it (where multiple local clusters are attached to each central server). Even 
though the model is too simple in and of itself to warrant treatment by parallel simulation, we 
use it because it is capable of parametrically representing more complex models. For example, the 
mode] parameter p c can be used to adjust the computation/communication ratio. The performance 
of the synchronization protocol is largely independent of the specifics of the simulation workload 
However, the frequency with which the model communicates and synchronizes obviously affects 
performance, and p c is a simple parametric means of varying workload intensity. Similarly, the 
number of jobs circulating in the system is another parameter that affects the workload intensity. 
We can control the level of uniformization by adjusting p f ~the higher it becomes, the faster the 
uniformization rates on externa] streams. 

Our study sets p c - 0.99. This implies a healthy computation/communication ratio proportional 
to 200 (an average of 100 visits to the CPU and some I/O device before exiting the cluster) 
but only in an “optimal” parallel simulation whose only communication costs are those of moving 
jobs. The actual ratio will be degraded from this level by uniformization. Because of the relatively 
ugh cost of message-passing, any application running on a machine such as the Delta must have 
a respectable computation/communication ratio to achieve respectable speedups. We also fix the 
probability of a fast job (p f ) to be 0.01. This selection places stress on our algorithms, because 
strict uniformization rates must assume that every server is always busy with a fast job, when in 
act, ast jobs rarely appear. Our study fixes the number of centra] server clusters at 256. This 
selection gives us a moderately large simulation model, and also enables us to examine the effects 
of managing many LPs (up to 16) on a processor. Finally, we set the CPU service rate to 20, and 

the slow I/O job rate at 1. This ensures that in steady-state the distribution of jobs will be more 
or less uniform among all queues. 

The parameters we vary are 

. Number of jobs: We examine lightly loaded scenarios, where there are 10 jobs per cluster 
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Pc 



I/O queues 

Figure 1: Central server model. p c is the probability that a job departing an I/O device will return 
to the CPU queue. 

(about 0.5 jobs/queue), and heavily loaded scenarios where there are 1000 jobs per cluster 
(about 50 jobs/queue). 

• fij: We examine a fast job rate of 1 (so there is no distinction between fast and slow jobs), 

and a fast job rate of 8. The latter selection, coupled with with p f = 0.01, induces high rates 
of uniformization relative to actual stream transition rates. 

• Number of processors: We study our models on 4 x 4, 8 x 8, and 16 x 16 submeshes of 
the Delta. 

Every experiment was run long enough so that every processor executes approximately 0.5 
million events. Our primary metric of interest is the event execution rate, which measures the rate 
at which useful events are executed (per second). We specifically exclude from this rate pseudo 
events and optimistically executed events that are later rolled back. The rates we present are from 
single runs; this is justified, as in our experience there is very little variation (perhaps 1%) in these 
execution rates between runs of the same model. 

While simple, the model we study presents a challenge to any performance oriented study, 
especially of a conservative synchronization algorithm. There is virtually no locality; a cluster is no 
more likely to communicate with a co-resident one than it is to communicate with an off-processor 
one. Every cluster communicates with every other cluster — there are approximately 2 16 distinct 
communication paths to manage! Using P processors, every time a communication occurs there is 
A ( P — 1)/P% chance that the communication is between different processors. Furthermore, the 
model uniformization is consistent with a queueing policy where newly arriving fast jobs to preempt 
slow jobs. The only benign assumption made is that p c = 0.99, an assumption needed to ensure 
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a sufficient computation/communication ratio. Finally, the maxima) processor site, P - 256 is 

-sxr “ i:; 

execution on a difficult problem proves the vaUdity of our methods. SneeduD 

Before analyzing the results of our experiments, we address the issue o spee P • 

' rr -n t tip n nint Table 1 below gives serial execution rates as a functi 

interest change. To illustrate the pom , , „p piyrte TUn 


(load,/i/) 


(light, 1) 

( heavy, 1) 
(light, 8) 
(heavy, 8) 


Optimized Serial 
Algorithm 


6211 

6563 

6219 

6554 


PUCS on 
One Processor 


7014 

7706 

4166 

6469 


Tabic 1: Execution rates (cveuts/scc) of the optimised serial algorithm and PUCS runntng on one 
processor. 

it is slower by 33% on another. By comparison, the optimized serial algorithm varies by only a 
few S percent over these problems. A user is far more likely to choose a serial algorithm that is 

consistently good over one whose performance varies so widely. 

Table 2 presents the results of our experiments. Without resorting to a defim ion of speed , 

„e can say that on the heavily loaded problem with W = 1 «smg 256 processors CF-POCS £ 
260 times faster than the particular serial simulator we used, and is 221 times as er 
own one processor implementation (and 14 times faster than its 16 processor implementation). In 
'Ih rase it is clear that a very substantial improvement over seria) execution is eingacueve 
Is al f^i'tionlfpoint of comparison, we measured the execution rate of the commerci, pueuemg 
network simulator RESQ [3), executing on an IBM 3090 mainframe. The 

is actually substantially smaller than this one, having only 16 clusters. The RESQ execn ion rMe 
1 vTiSl events/sec Of course, one must take into account that RESQ is an Industrie quality 
IZLor able to handle a wide range of problems, whereas the PUCS code is handcrafted and 
optimized, with a much more restrictive domain. Nevertheless, this comparison illustrates pars! 

simulation’s tremendous potential for accelerating solution times. 

We next analyze this data with an eye towards addressing the issues of aggregat.on, common,- 

cation costs, optimism, and adaptiveness. 
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16 Processors 
light heavy 

64 Processors 
light heavy 

256 Processors 
light heavy 



Fast Job Rate 

= 1 



CA-PUCS 

80,032 

102,504 

301,765 

411,362 

985,327 

1,575,146 

CP-PUCS 

109,585 

122,186 

378,329 

393,418 

1,043,609 

1,709,567 

Opt-PUCS 

103,707 

121,609 

343,510 

353,873 

874,617 

855,737 

ACA-PUCS 

79,711 

102,329 

311,168 

403,380 

989,351 

1,567,038 

ACP-PUCS 

78,942 

100,877 

256,138 

327,559 

808,621 

1,357,975 



Fast Job Rate 

= 8 



CA-PUCS 

53,339 

76,580 

181,785 

299,660 

668,323 

1,147,282 

CP-PUCS 

58,753 

90,708 

202,205 

311,920 

457,252 

934,120 

Opt-PUCS 

57,314 

89,642 

167,382 

328,711 

445,880 

802,732 

ACA-PUCS 

74,580 

90,857 

258,018 

352,763 

851,204 

1,304,502 

ACP-PUCS 

63,738 

88,156 

203,168 

282,835 

547,770 

926,403 


Table 2: Execution rates (events/sec) of fully connected model of 256 central server clusters with 
p c = 0.99, pj — 0.01. Fast job service rate is varied between 1 and 8; average number number of 
jobs per cluster is varied from 10 (light) to 1000 (heavy). Simulation is executed on 16, 64, and 
256 processors of the Intel Touchstone Delta. 

4.1 CP-PUCS vs CA-PUCS 

Our earlier studies of CA-PUCS and CP-PUCS (on an Intel iPSC/2) indicated that the CP- 
PUCS overheads of managing multiple LPs and of internal pseudos between on-processor clusters 
outweighed the advantages of increased opportunity for parallelism and avoidance of synchronous 
appointment generation. Yet the data in the present study shows that this is not always true. 
Consider Table 3 which gives the ratio of CP-PUCS rates to CA-PUCS rates, as a function of 
problem characteristics and architecture size. 

The overall trend is for CP-PUCS to outperform CA-PUCS, but there are still instances where 
the reverse is true. 


(load,/xy ) 

16 Processors 

64 Processors 

256 Processors 

(light, 1) 

1.37 

1.25 

1.05 

( heavy, 1) 

1.19 

0.95 

1.08 

(light, 8) 

1.10 

1.11 

0.68 

(heavy, 8) 

1.18 

1.35 

0.81 


Table 3: Ratio of CP PUCS/CA-PUCS execution rates. 
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CP-PUCS and CA-PUCS differ both with respect to aggregation, and with respect to message 
handling. As snch, it is difficult to separate the influences of aggregation and communication costs. 
Furthermore, the communication costs will depend on the underlying architecture, as well as the 
operating system. There are at least four factors to take into consideration, which sometimes 
interact in a complex manner. 

• An LP’s execution time-slice is delimited by communication instants. When pj — 8 the 
uniformization rate is eight times larger, so that there are eight times as many communication 
instants per unit time. An LP’s execution time-slice is much shorter, so that the overhead of 
switching between LPs is suffered eight times as often. 

• In the lightly loaded experiments (and those where \i$ — 8), most communications report 
pseudo events. Thus, when CA-PUCS blocks, it usually waits for a communication that 
doesn’t affect its state. There is thus no useful purpose gained by blocking, other than the 
assurance of logical correctness. CP-PUCS is better able to find and execute useful work, 
when such work exists. 

• As we increase the number of processors we decrease the number of clusters on a processor. 
This increasingly limits CP-PUCS’ ability to find useful work that CA-PUCS cannot find. Of 
course, at 256 processors, both CP-PUCS and CA-PUCS each have one cluster per processor, 
and thus behave identically with respect to synchronization. 

• CA-PUCS has a global step where synchronization appointments are generated and ex- 
changed. Its performance will thus be affected by the efficiency with which an all-to-all 
exchange can be performed, and by the frequency of this exchange. CP-PUCS has no corre- 
sponding cost. 

Let us examine performance with these factors in mind. On these experiments CP-PUCS tends 
to perform better. Apparently, on this model, the scheduling and appointment generation advan- 
tages outweigh CA-PUCS advantages. The difference between the two tends to diminish as the 
number of processors increases, which is consistent with the fact that (i) the CP-PUCS scheduling 
advantage gets smaller as a processor has fewer and fewer clusters, and (ii) in a CA-PUCS ap- 
pointments exchange, essentially the same communication workload is spread over more network 
hardware, reducing the frequency of collisions and blocking. Thus, as the number of processors in- 
creases the CA-PUCS advantage diminishes and the CP-PUCS disadvantage diminishes. However, 
there are clearly other factors at work, as the performance differences change neither smoothly nor 
monotonically as the number of processors increase. 

Our earlier comparison of CP-PUCS and CA-PUCS found CA-PUCS to be clearly superior. 
One explanation is that the models studied are different in an important way. The earlier model 
appends 10 “local clusters” of queues to every central server queue. In those studies, p c = 0.0, and 
a job leaving an I/O device can be routed either to another central server cluster (with probability 
p cc ) or to one of its local clusters. Upon leaving the local cluster the job returns to the same 
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(load,///) 

16 Processors 

64 Processors 

256 Processors 

(light, 1) 

1.05 

1.10 

1.19 

(heavy, 1) 

1.00 

1.10 

1.99 

(light, 8) 

1.02 

1.20 

1.02 

(heavy, 8) 

1.01 

0.90 

1.16 


Table 4: Ratio of CP-PUCS/Opt-PUCS execution rates. 

central server. This model provides another way of boosting the computation /communication 
ratio, because a local cluster is always mapped to the same processor as its parent central server 
cluster. Our previous study varied the probability p cc of routing a job from one central server to 
another one, on a different processor. As p cc increases, CP-PUCS performance drops faster than 
that of CA-PUCS, because CP-PUCS suffers increasingly from internal pseudo transitions between 
a central server and its local clusters. The present set of experiments are somewhat kinder to 
CP-PUCS, as the level of interaction between co- resident LPs is much lower. It seems then that 
the level of internal uniformization is the deciding factor between CA-PUCS and CP-PUCS. This 
implies that close attention must be paid when partitioning a simulation model into LPs for PIT'S, 
perhaps deciding which style of synchronization to use as a function of uniformization rates. 

4.2 Whither Optimism? 

These experiments offer clear insight into the potential of exploiting optimism in PUCS, because 
the only substantive difference between CP-PUCS and Opt- PUCS is the optimistic processing. 
Towards this end, Table 4 computes the ratio of CP-PUCS to Opt-PUCS execution rates. 

The first thing we notice is that CP-PUCS tends to do a little better than Opt-PUCS. Next 
we notice is that the degree to which CP-PUCS does better tends to increase as the number of 
processors increases. Indeed, for all practical purposes the performance on 16 processors is identical; 
yet at 256 processors, in one case CP-PUCS was nearly twice as fast as Opt-PUCS. 

Explanations for this behavior are found by looking at the costs suffered by executing optimisti- 
cally, primarily event re-execution and state-saving. Table 5 computes the ratio of the number of 
total events (excluding pseudos) executed to the number of events (excluding pseudos) committed. 
One can also view this as the average number of times a non-pseudo event is executed. The table 
also computes the average number of state-saves per committed non-pseudo event. 

One thing clearly shown is that, in this example, the cost of saving the state of one central 
server cluster (about 3000 bytes) is usually amortized over many events. Its effect on performance 
must be negligible. Any significant differences between CP-PUCS and Opt-PUCS are related to 
the cost of rolling back and re-executing events. Indeed, there is a direct correlation between high 
event execution ratios and significant gaps between CP-PUCS and Opt-PUCS. 

Since re-execution costs define the difference between CP-PUCS and Opt- PI CS, it is si pie 





Total/Committed Events 

Average State Saves/ Event 

(load,///) 

16 

64 

256 

16 

64 

256 


Processors 

Processors 

Processors 

Processors 

Processors 

Processors 

(light, 1) 

l.n 

1.19 

1.68 

0.008 

0.010 

0.027 

( heavy, 1) 

1.03 

1.40 

2.10 

0.001 

0.002 

0.007 

(light, 8) 

1.01 

1.06 

1.34 

0.060 

0.100 

0.017 

(heavy, 8) 

1.01 

1.04 

1.27 

0.004 

0.015 

0.041 


Table 5: Overheads associated with Opt-PUCS. 

to explain why the gap between them increases as the number of processors increases. On only 16 
processors, many LPs are assigned are assigned to the same processor, and thus Opt-PUCS has a 
good chance of being able to schedule a sure cluster. However, for a large number of processors there 
are relatively few LPs on a processor. Without a large number of LPs, a processor quickly executes 
its sure workload and is left to forge ahead optimistically. Apparently its optimism is frequently 
misplaced, and significant fractions of events end up being resimulated. This effect is somewhat 
lessened when there are many pseudo events, since in such cases the optimistic assumption that 
the event is a pseudo event is in fact correct. 

4,3 Adaptivity 

Pseudo-events are the largest source of performance degradation in all versions of PUCS. Many 
CTMC models have characteristics that cause the best upper bound on an external event stream's 
transition rate to be very far from the stream’s average transition rate. In our experiments fast 
jobs appear infrequently, and one almost never sees more than 3 simultaneous fast jobs in a central 
server cluster. Yet the uniformization bound must be based on the assumption that all servers are 
busy with fast jobs. 

Table 6 illustrates the sensitivity of each method to increased uniformization, by computing 
the ratio of its execution rate using /xy = 1 to its rate using fij — 8. This data shows clearly 
that ACA-PUCS and ACP-PUCS are more tolerant of increased uniformization than are the other 
methods (with the exception of ACP-PUCS using 256 processors). Similar observations held in 
our previous study of ACA-PUCS that varied fij more widely, up to fij = 1024. Even at levels of 
/// = 256, ACA-PUCS gives respectable performance while CA-PUCS performance has thoroughly 
degenerated. We believe that any standardized version of PUCS must include adaptivity if it is to 
work on a wide range of problems. 

The relatively weak performance of ACP-PUCS surprised us, as we expected it to gain the 
advantages of both scheduling flexibility, and adaptivity. We have reason to believe that its failure 
to do so rests somehow with the Delta architecture and NX operating system, because these expec- 
tations are meet using the Intel iPSC/2. Execution rates taken from a 16 processor configuration 
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Light Load 

Heavy Load 

Algorithm 

16 64 256 

Processors Processors Processors 

16 64 256 

Processors Processors Processors 

CA-PUCS 
CP- PUCS 
Opt-PUCS 
ACA-PUCS 
ACP-PUCS 

1.50 1.66 1-47 

1.86 1-87 2.28 

1 .80 2.05 1 -96 

1.07 1-20 1-16 

1.23 1.14 1 -26 

1.34 1.37 1 37 

1.34 1-25 1-83 

1 .35 1 -07 1 -06 

1.12 1.14 1-20 

1 16 1-47 1.46 




Table 6: Ratio of pj = 1 to pj = 8 execution rates. 



light 

heavy 

light 

heavy 


Fast Job Rate — 1 

Fast Job Rate = 8 

CA-PUCS 

10,637 

13,431 

7,788 

11,216 

CP-PUCS 

13,149 

15,329 

9,258 

13,224 

ACA-PUCS 

10,679 

13,212 

8,118 

10,553 

ACP-PUCS 

12,975 

L 

15,148 

11,276 

13,935 


Table 7: Execution rates on 16 processors of Intel iPSC/2 

are given in Table 7. We see .hat when W = 1 ACP-PUCS gets very nearly the performance of 
CP.PUCS (whose performance is best), while ACA-PUCS does not do as well ow.ng to ,ts bas.s ... 
CA-PUCS. Then when uj = 8, ACP-PUCS becomes the best method over all. 

Regardless of whether ACP-PUCS meets our expectations or not, it is evident that adaptiveness 
offers performance gains for = 8, when the gap between the maximum and average external 

transition rates increases. 


5 Conclusions 

This paper looked at the problem of parallelising the simulation of continuous time Markov chains. 
We showed how the notion of uniformization can be applied so that the simulation can x con 
ducted by essentially pre-computing an inter-LP synchronisation schedule, and 
mathematically correct sample path through that schedule. Tins bas.c method ,s called PU( S 
We described live different PUCS variations, and examine performance on a parameterized mot e 
designed to illustrate their respective strengths and weaknesses. The experiments were conducted 
on the Intel Touchstone Delta multiprocessor, using 16, 64 and 256 processors. 

The results of these experiments, taken in conjunction with others previously conducted, g- 
gest that an optimized PUCS algorithm ought to incorporate conservat.ve synchro,,, nation, an, 
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adaptive uniformization rates. Issues of aggregation and communication seem to be dependent on 
the simulation model, and underlying architecture and/or operating system. More work is needed 
to fully understand the complex relationships between these factors. The performance we observe 
can often be quite good, depending on the problem characteristics. However, PUCS performance 

is inescapably dependent on the number of pseudo-events, and every effort must be made to reduce 
these. 

While our experiments prove the promise of PUCS, some important issues remain open. We 
have not yet addressed automated partitioning, nor automated load balancing, nor the effect one 
has on the other. We intend to investigate these issues. 
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